1. Introduction

Music is something everyone loves. The music industry is one of the biggest industries in the world, the estimated value of it is over $50 billion. Then what makes a song popular? Identifying the factors that affect the popularity of song and predicting it can provide us with a better understanding of the music industry and valuable business insights. Our project is focused on building machine learning models to predict the popularity of song on spotify, by answering two smart questions. The first question is “Which features make a song popular?”, and we will determine audio features in a song that affect the popularity. The second question is “Can we recommend a certain song to users who like to listen popular songs?”

This summary paper is organized as follow: 1. Introduction 2. Description of Data 3. Data Pre-Processing 4. EDA 5. Models 6. Conclusion

2. Description of Data

As we were looking for songs dataset, we thought of checking if Spotify which is one of the best music app across the market, provides any data. Thus, initially we attempted to retrieve data using Spotify API. But due to time constraint, we were unable to work through the data. Fortunately, we were able to get the Spotify dataset from Kaggle.
Let us pull the dataset from the file spotify_dataset.csv into a dataframe df_spotify. As this file is huge in size, we cannot upload it on GitHub. Thus, we are storing the file in a folder called spotify_dataset that is located in one folder above the project folder.

df_spotify <- data.frame(read.csv('spotify_dataset.csv'))
str(df_spotify)
## 'data.frame':    114000 obs. of  21 variables:
##  $ X               : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ track_id        : chr  "5SuOikwiRyPMVoIQDJUgSV" "4qPNDBW1i3p13qLCt0Ki3A" "1iJBSr7s7jYXzM8EGcbK5b" "6lfxq3CG4xtTiEg7opyCyx" ...
##  $ artists         : chr  "Gen Hoshino" "Ben Woodward" "Ingrid Michaelson;ZAYN" "Kina Grannis" ...
##  $ album_name      : chr  "Comedy" "Ghost (Acoustic)" "To Begin Again" "Crazy Rich Asians (Original Motion Picture Soundtrack)" ...
##  $ track_name      : chr  "Comedy" "Ghost - Acoustic" "To Begin Again" "Can't Help Falling In Love" ...
##  $ popularity      : int  73 55 57 71 82 58 74 80 74 56 ...
##  $ duration_ms     : int  230666 149610 210826 201933 198853 214240 229400 242946 189613 205594 ...
##  $ explicit        : chr  "False" "False" "False" "False" ...
##  $ danceability    : num  0.676 0.42 0.438 0.266 0.618 0.688 0.407 0.703 0.625 0.442 ...
##  $ energy          : num  0.461 0.166 0.359 0.0596 0.443 0.481 0.147 0.444 0.414 0.632 ...
##  $ key             : int  1 1 0 0 2 6 2 11 0 1 ...
##  $ loudness        : num  -6.75 -17.23 -9.73 -18.52 -9.68 ...
##  $ mode            : int  0 1 1 1 1 1 1 1 1 1 ...
##  $ speechiness     : num  0.143 0.0763 0.0557 0.0363 0.0526 0.105 0.0355 0.0417 0.0369 0.0295 ...
##  $ acousticness    : num  0.0322 0.924 0.21 0.905 0.469 0.289 0.857 0.559 0.294 0.426 ...
##  $ instrumentalness: num  1.01e-06 5.56e-06 0.00 7.07e-05 0.00 0.00 2.89e-06 0.00 0.00 4.19e-03 ...
##  $ liveness        : num  0.358 0.101 0.117 0.132 0.0829 0.189 0.0913 0.0973 0.151 0.0735 ...
##  $ valence         : num  0.715 0.267 0.12 0.143 0.167 0.666 0.0765 0.712 0.669 0.196 ...
##  $ tempo           : num  87.9 77.5 76.3 181.7 119.9 ...
##  $ time_signature  : int  4 4 4 3 4 4 3 4 4 4 ...
##  $ track_genre     : chr  "acoustic" "acoustic" "acoustic" "acoustic" ...

It has 114000 observations with 21 variables.
The data entries include the following variables:
track_id: ID of track generated by Spotify
artists: artist’s name
album_name: name of albums
track_name: name of tracks
popularity: ranges from 0 to 100
duration_ms: duration of track in milliseconds; integer typically ranges from 200k to 300k
explicit: 0 = no explicit content, 1 = explicit content
danceability: ranges from 0 to 1
energy: ranges from 0 to 1
key: all keys on octave encoded as values ranging from 0 to 11, starting with C as 0, C# as 1, etc.
loudness: float typically ranging from -60 to 0
mode: indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0
speechiness: The higher the value the more spoken word the song contains; ranges from 0 to 1
acousticness: ranges from 0 to 1
instrumentalness: the number of vocals in a song; ranges from 0 to 1
liveness: The higher the value, the more likely the song is a live recording; ranges from 0 to 1
valence: The higher the value, the more positive mood for the song; ranges from 0 to 1
tempo: float typically ranging from 50 to 150
time_signature: an estimated time signature. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from 3 to 7 indicating time signatures of “3/4”, to “7/4”
track_genre: genre of track
speechiness: ranges from 0 to 1
acousticness: ranges from 0 to 1

The target variable for us is “Popularity”. It has a range from 0 to 100 and type of integer. The popularity is determined by an algorithm and primarily depends on how recently and how many times the track has been played overall. In general, songs that are played often currently will be more popular than songs that were frequently performed in the past.

head(df_spotify)
##   X               track_id                artists
## 1 0 5SuOikwiRyPMVoIQDJUgSV            Gen Hoshino
## 2 1 4qPNDBW1i3p13qLCt0Ki3A           Ben Woodward
## 3 2 1iJBSr7s7jYXzM8EGcbK5b Ingrid Michaelson;ZAYN
## 4 3 6lfxq3CG4xtTiEg7opyCyx           Kina Grannis
## 5 4 5vjLSffimiIP26QG5WcN2K       Chord Overstreet
## 6 5 01MVOl9KtVTNfFiBU9I7dc           Tyrone Wells
##                                               album_name
## 1                                                 Comedy
## 2                                       Ghost (Acoustic)
## 3                                         To Begin Again
## 4 Crazy Rich Asians (Original Motion Picture Soundtrack)
## 5                                                Hold On
## 6                                   Days I Will Remember
##                   track_name popularity duration_ms explicit danceability
## 1                     Comedy         73      230666    False        0.676
## 2           Ghost - Acoustic         55      149610    False        0.420
## 3             To Begin Again         57      210826    False        0.438
## 4 Can't Help Falling In Love         71      201933    False        0.266
## 5                    Hold On         82      198853    False        0.618
## 6       Days I Will Remember         58      214240    False        0.688
##   energy key loudness mode speechiness acousticness instrumentalness liveness
## 1 0.4610   1   -6.746    0      0.1430       0.0322         1.01e-06   0.3580
## 2 0.1660   1  -17.235    1      0.0763       0.9240         5.56e-06   0.1010
## 3 0.3590   0   -9.734    1      0.0557       0.2100         0.00e+00   0.1170
## 4 0.0596   0  -18.515    1      0.0363       0.9050         7.07e-05   0.1320
## 5 0.4430   2   -9.681    1      0.0526       0.4690         0.00e+00   0.0829
## 6 0.4810   6   -8.807    1      0.1050       0.2890         0.00e+00   0.1890
##   valence   tempo time_signature track_genre
## 1   0.715  87.917              4    acoustic
## 2   0.267  77.489              4    acoustic
## 3   0.120  76.332              4    acoustic
## 4   0.143 181.740              3    acoustic
## 5   0.167 119.949              4    acoustic
## 6   0.666  98.017              4    acoustic
tail(df_spotify)
##             X               track_id          artists
## 113995 113994 4WbOUe6T0sozC7z5ZJgiAA   Lucas Cervetti
## 113996 113995 2C3TZjDRiAzdyViavDJ217    Rainy Lullaby
## 113997 113996 1hIz5L4IB9hN3WRYPOCGPw    Rainy Lullaby
## 113998 113997 6x8ZfSoqDjuNa5SVP5QjvX    Cesária Evora
## 113999 113998 2e6sXL2bYv4bSz6VTdnfLs Michael W. Smith
## 114000 113999 2hETkH7cOfqmz3LqZDHZf5    Cesária Evora
##                                                                             album_name
## 113995                                                    Frecuencias Álmicas en 432hz
## 113996 #mindfulness - Soft Rain for Mindful Meditation, Stress Relief Relaxation Music
## 113997 #mindfulness - Soft Rain for Mindful Meditation, Stress Relief Relaxation Music
## 113998                                                                         Best Of
## 113999                                                               Change Your World
## 114000                                                                  Miss Perfumado
##                      track_name popularity duration_ms explicit danceability
## 113995 Frecuencia Álmica, Pt. 4         22      305454    False        0.331
## 113996      Sleep My Little Boy         21      384999    False        0.172
## 113997         Water Into Light         22      385000    False        0.174
## 113998           Miss Perfumado         22      271466    False        0.629
## 113999                  Friends         41      283893    False        0.587
## 114000                Barbincor         22      241826    False        0.526
##        energy key loudness mode speechiness acousticness instrumentalness
## 113995  0.171   1  -15.668    1      0.0350        0.920           0.0229
## 113996  0.235   5  -16.393    1      0.0422        0.640           0.9280
## 113997  0.117   0  -18.318    0      0.0401        0.994           0.9760
## 113998  0.329   0  -10.895    0      0.0420        0.867           0.0000
## 113999  0.506   7  -10.889    1      0.0297        0.381           0.0000
## 114000  0.487   1  -10.204    0      0.0725        0.681           0.0000
##        liveness valence   tempo time_signature track_genre
## 113995   0.0679  0.3270 132.147              3 world-music
## 113996   0.0863  0.0339 125.995              5 world-music
## 113997   0.1050  0.0350  85.239              4 world-music
## 113998   0.0839  0.7430 132.378              4 world-music
## 113999   0.2700  0.4130 135.960              4 world-music
## 114000   0.0893  0.7080  79.198              4 world-music
summary(df_spotify)
##        X            track_id           artists           album_name       
##  Min.   :     0   Length:114000      Length:114000      Length:114000     
##  1st Qu.: 28500   Class :character   Class :character   Class :character  
##  Median : 57000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 57000                                                           
##  3rd Qu.: 85499                                                           
##  Max.   :113999                                                           
##   track_name          popularity      duration_ms        explicit        
##  Length:114000      Min.   :  0.00   Min.   :      0   Length:114000     
##  Class :character   1st Qu.: 17.00   1st Qu.: 174066   Class :character  
##  Mode  :character   Median : 35.00   Median : 212906   Mode  :character  
##                     Mean   : 33.24   Mean   : 228029                     
##                     3rd Qu.: 50.00   3rd Qu.: 261506                     
##                     Max.   :100.00   Max.   :5237295                     
##   danceability        energy            key            loudness      
##  Min.   :0.0000   Min.   :0.0000   Min.   : 0.000   Min.   :-49.531  
##  1st Qu.:0.4560   1st Qu.:0.4720   1st Qu.: 2.000   1st Qu.:-10.013  
##  Median :0.5800   Median :0.6850   Median : 5.000   Median : -7.004  
##  Mean   :0.5668   Mean   :0.6414   Mean   : 5.309   Mean   : -8.259  
##  3rd Qu.:0.6950   3rd Qu.:0.8540   3rd Qu.: 8.000   3rd Qu.: -5.003  
##  Max.   :0.9850   Max.   :1.0000   Max.   :11.000   Max.   :  4.532  
##       mode         speechiness       acousticness    instrumentalness  
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00e+00  
##  1st Qu.:0.0000   1st Qu.:0.03590   1st Qu.:0.0169   1st Qu.:0.00e+00  
##  Median :1.0000   Median :0.04890   Median :0.1690   Median :4.16e-05  
##  Mean   :0.6376   Mean   :0.08465   Mean   :0.3149   Mean   :1.56e-01  
##  3rd Qu.:1.0000   3rd Qu.:0.08450   3rd Qu.:0.5980   3rd Qu.:4.90e-02  
##  Max.   :1.0000   Max.   :0.96500   Max.   :0.9960   Max.   :1.00e+00  
##     liveness         valence           tempo        time_signature 
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :0.000  
##  1st Qu.:0.0980   1st Qu.:0.2600   1st Qu.: 99.22   1st Qu.:4.000  
##  Median :0.1320   Median :0.4640   Median :122.02   Median :4.000  
##  Mean   :0.2136   Mean   :0.4741   Mean   :122.15   Mean   :3.904  
##  3rd Qu.:0.2730   3rd Qu.:0.6830   3rd Qu.:140.07   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :0.9950   Max.   :243.37   Max.   :5.000  
##  track_genre       
##  Length:114000     
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

First, let’s check the percentage of NA’s present in each columns of the dataset.

(colMeans(is.na(df_spotify)))*100
##                X         track_id          artists       album_name 
##                0                0                0                0 
##       track_name       popularity      duration_ms         explicit 
##                0                0                0                0 
##     danceability           energy              key         loudness 
##                0                0                0                0 
##             mode      speechiness     acousticness instrumentalness 
##                0                0                0                0 
##         liveness          valence            tempo   time_signature 
##                0                0                0                0 
##      track_genre 
##                0
data_corr<-df_spotify %>% select_if(is.numeric)   
data_corr = subset(data_corr, select = -c(X) )

corrplot(cor(data_corr),
         type = "upper",
         col = colorRampPalette(c("white", "deepskyblue", "blue4"))(100),
         is.corr = TRUE,
         order = "hclust",
         tl.col = "black",
         diag = TRUE)

We haven’t dropped any observations since there are no NA values. The columns ‘track id’ and ‘X’ were dropped since they did not have any useful information.

3. Data Pre-Processing

In the Data Preprocessing step, We are performing the following operations

  1. Converting the column duration_ms into duration_min.
  2. We are replacing all the unwanted symbols in artists column by blank.
df_spotify<-df_spotify %>%
 mutate(duration_min = duration_ms/60000)

df_spotify$duration_min <- round(df_spotify$duration_min ,digit=2)
head(df_spotify)
##   X               track_id                artists
## 1 0 5SuOikwiRyPMVoIQDJUgSV            Gen Hoshino
## 2 1 4qPNDBW1i3p13qLCt0Ki3A           Ben Woodward
## 3 2 1iJBSr7s7jYXzM8EGcbK5b Ingrid Michaelson;ZAYN
## 4 3 6lfxq3CG4xtTiEg7opyCyx           Kina Grannis
## 5 4 5vjLSffimiIP26QG5WcN2K       Chord Overstreet
## 6 5 01MVOl9KtVTNfFiBU9I7dc           Tyrone Wells
##                                               album_name
## 1                                                 Comedy
## 2                                       Ghost (Acoustic)
## 3                                         To Begin Again
## 4 Crazy Rich Asians (Original Motion Picture Soundtrack)
## 5                                                Hold On
## 6                                   Days I Will Remember
##                   track_name popularity duration_ms explicit danceability
## 1                     Comedy         73      230666    False        0.676
## 2           Ghost - Acoustic         55      149610    False        0.420
## 3             To Begin Again         57      210826    False        0.438
## 4 Can't Help Falling In Love         71      201933    False        0.266
## 5                    Hold On         82      198853    False        0.618
## 6       Days I Will Remember         58      214240    False        0.688
##   energy key loudness mode speechiness acousticness instrumentalness liveness
## 1 0.4610   1   -6.746    0      0.1430       0.0322         1.01e-06   0.3580
## 2 0.1660   1  -17.235    1      0.0763       0.9240         5.56e-06   0.1010
## 3 0.3590   0   -9.734    1      0.0557       0.2100         0.00e+00   0.1170
## 4 0.0596   0  -18.515    1      0.0363       0.9050         7.07e-05   0.1320
## 5 0.4430   2   -9.681    1      0.0526       0.4690         0.00e+00   0.0829
## 6 0.4810   6   -8.807    1      0.1050       0.2890         0.00e+00   0.1890
##   valence   tempo time_signature track_genre duration_min
## 1   0.715  87.917              4    acoustic         3.84
## 2   0.267  77.489              4    acoustic         2.49
## 3   0.120  76.332              4    acoustic         3.51
## 4   0.143 181.740              3    acoustic         3.37
## 5   0.167 119.949              4    acoustic         3.31
## 6   0.666  98.017              4    acoustic         3.57
df_spotify$artists[df_spotify$artists == '['] <- ''
df_spotify$artists[df_spotify$artists == ']']<- ''
df_spotify$artists[df_spotify$artists == "'"]<- ''
head(df_spotify)
##   X               track_id                artists
## 1 0 5SuOikwiRyPMVoIQDJUgSV            Gen Hoshino
## 2 1 4qPNDBW1i3p13qLCt0Ki3A           Ben Woodward
## 3 2 1iJBSr7s7jYXzM8EGcbK5b Ingrid Michaelson;ZAYN
## 4 3 6lfxq3CG4xtTiEg7opyCyx           Kina Grannis
## 5 4 5vjLSffimiIP26QG5WcN2K       Chord Overstreet
## 6 5 01MVOl9KtVTNfFiBU9I7dc           Tyrone Wells
##                                               album_name
## 1                                                 Comedy
## 2                                       Ghost (Acoustic)
## 3                                         To Begin Again
## 4 Crazy Rich Asians (Original Motion Picture Soundtrack)
## 5                                                Hold On
## 6                                   Days I Will Remember
##                   track_name popularity duration_ms explicit danceability
## 1                     Comedy         73      230666    False        0.676
## 2           Ghost - Acoustic         55      149610    False        0.420
## 3             To Begin Again         57      210826    False        0.438
## 4 Can't Help Falling In Love         71      201933    False        0.266
## 5                    Hold On         82      198853    False        0.618
## 6       Days I Will Remember         58      214240    False        0.688
##   energy key loudness mode speechiness acousticness instrumentalness liveness
## 1 0.4610   1   -6.746    0      0.1430       0.0322         1.01e-06   0.3580
## 2 0.1660   1  -17.235    1      0.0763       0.9240         5.56e-06   0.1010
## 3 0.3590   0   -9.734    1      0.0557       0.2100         0.00e+00   0.1170
## 4 0.0596   0  -18.515    1      0.0363       0.9050         7.07e-05   0.1320
## 5 0.4430   2   -9.681    1      0.0526       0.4690         0.00e+00   0.0829
## 6 0.4810   6   -8.807    1      0.1050       0.2890         0.00e+00   0.1890
##   valence   tempo time_signature track_genre duration_min
## 1   0.715  87.917              4    acoustic         3.84
## 2   0.267  77.489              4    acoustic         2.49
## 3   0.120  76.332              4    acoustic         3.51
## 4   0.143 181.740              3    acoustic         3.37
## 5   0.167 119.949              4    acoustic         3.31
## 6   0.666  98.017              4    acoustic         3.57

4. EDA

Let us begin our EDA to answer the first question, “Which features make a song popular?”.

popHist <- ggplot(df_spotify, aes(x=popularity)) + geom_histogram(color="black", fill="steelblue1", alpha=0.9, bins = 101)+
  ggtitle("Histogram of Popularity")
popHist

From the histogram of popularity, we can see there are many songs that have popularity of 0, which made 0 as a mode. There are more values from the left-half side than the right-side, which means it’s right skewed.

# made basic histogram & scatterplot for every other variables that we didn't include to our SMART Q

danceHist <- ggplot(df_spotify, aes(x=danceability)) + geom_histogram(color="black", fill="steelblue1", alpha=0.9, bins = 150)+
  ggtitle("Histogram of Danceability")
danceHist

danceSct<- ggplot(df_spotify, aes(x=popularity, y= danceability)) +
  geom_point(colour="steelblue1",size = 0.1) +
  ggtitle("Scatterplot for Danceability")
danceSct

acHist <- ggplot(df_spotify, aes(x=acousticness)) + geom_histogram(color="black", fill="steelblue1", alpha=0.9, bins = 100)+
  ggtitle("Histogram of Acousticness")
acHist

acSct<- ggplot(df_spotify, aes(x=popularity, y= acousticness)) +
  geom_point(colour="steelblue1",size = 0.1) +
  ggtitle("Scatterplot for Acousticness")
acSct

energyHist <- ggplot(df_spotify, aes(x=energy)) + geom_histogram(color="black", fill="steelblue1", alpha=0.9, bins = 100)+
  ggtitle("Histogram of Energy")
energyHist

energySct<- ggplot(df_spotify, aes(x=popularity, y= energy)) +
  geom_point(colour="steelblue1",size = 0.1) +
  ggtitle("Scatterplot for Energy")
energySct

speechinessHist <- ggplot(df_spotify, aes(x=speechiness)) + geom_histogram(color="black", fill="steelblue1", alpha=0.9, bins = 100)+
  ggtitle("Histogram of speechiness")
speechinessHist

speechinessSct<- ggplot(df_spotify, aes(x=popularity, y= speechiness)) +
  geom_point(colour="steelblue1",size = 0.1) +
  ggtitle("Scatterplot for speechiness")
speechinessSct

instrumentalnessHist <- ggplot(df_spotify, aes(x=instrumentalness)) + geom_histogram(color="black",fill="steelblue1", alpha=0.9, bins = 50)+
  ggtitle("Histogram of instrumentalness")
instrumentalnessHist

instrumentalnessSctr<- ggplot(df_spotify, aes(x=popularity, y= instrumentalness)) +
  geom_point(colour="steelblue1",size = 0.1) +
  ggtitle("Scatterplot for instrumentalness")
instrumentalnessSctr

livenessHist <- ggplot(df_spotify, aes(x=liveness)) + geom_histogram(color="black",fill="steelblue1", alpha=0.9, bins = 100)+
  ggtitle("Histogram of liveness")
livenessHist

livenessSct<- ggplot(df_spotify, aes(x=popularity, y= liveness)) +
  geom_point(colour="steelblue1",size = 0.1) +
  ggtitle("Scatterplot for liveness")
livenessSct

valenceHist <- ggplot(df_spotify, aes(x=valence)) + geom_histogram(color="black",fill="steelblue1", alpha=0.9, bins = 100)+
  ggtitle("Histogram of valence")
valenceHist

valenceSctr<- ggplot(df_spotify, aes(x=popularity, y= valence)) +
  geom_point(colour="steelblue1",size = 0.1) +
  ggtitle("Scatterplot for valence")
valenceSctr

tempoHist <- ggplot(df_spotify, aes(x=tempo)) + geom_histogram(color="black",fill="steelblue1", alpha=0.9, bins = 100)+
  ggtitle("Histogram of tempo")
tempoHist

tempoSctr<- ggplot(df_spotify, aes(x=popularity, y= tempo)) +
  geom_point(colour="steelblue1",size = 0.1) +
  ggtitle("Scatterplot for tempo")
tempoSctr

We can see the distribution of variables with histograms. There are many right-skewed histograms including the histograms of Accousticness, Speechiness, Liveness, and Instrumentalness. However, for Danceability and Energy, the graphs are left-skewed. For tempo, the graph looks bimodal. And it looks like the Valence has an uniform histogram. By scatterplots, we can see the distribution of each variables with its popularity.

library("ggpubr")
histfig <- ggarrange(danceHist,acHist,energyHist, speechinessHist, instrumentalnessHist,livenessHist, valenceHist,tempoHist, ncol = 3, nrow = 2)
histfig
## $`1`

## 
## $`2`

## 
## attr(,"class")
## [1] "list"      "ggarrange"
sctfig <- ggarrange(danceSct, acSct,energySct, speechinessSct, instrumentalnessSctr,livenessSct, valenceSctr,tempoSctr, ncol = 3, nrow = 2)
sctfig
## $`1`

## 
## $`2`

## 
## attr(,"class")
## [1] "list"      "ggarrange"

Here you can see the correlation plot for the features of a song. As we can see in the plot, a lot of features are correlated to each other. For example, energy and loudness, danceability and valence etc. Coming to target variable popularity, it doesn’t really have a strong relation with any of the other features, nonetheless, there is some relationship there. For instance, instrumentalness, danceability, valence etc. We will try to understand these relations in dept in the slides ahead.

data_corr<-df_spotify[!duplicated(df_spotify),]
data_corr = subset(data_corr, select = -c(track_id,X,artists,album_name,track_name,explicit,track_genre) )
valence <- ggplot(data_corr, aes(x=valence)) + 
    geom_histogram(aes(y=..density..),      # Histogram with density instead of count on y-axis
                   binwidth=.5,
                   colour="steelblue1", fill="white") +
    geom_density(alpha=.2, fill="steelblue1")

tempo <- ggplot(data_corr, aes(x=tempo)) + 
    geom_histogram(aes(y=..density..),      # Histogram with density instead of count on y-axis
                   binwidth=.5,
                   colour="steelblue1", fill="white") +
    geom_density(alpha=.2, fill="steelblue1")

acousticness <- ggplot(data_corr, aes(x=acousticness)) + 
    geom_histogram(aes(y=..density..),      # Histogram with density instead of count on y-axis
                   binwidth=.5,
                   colour="steelblue1", fill="white") +
    geom_density(alpha=.2, fill="steelblue1")

danceability <- ggplot(data_corr, aes(x=danceability)) + 
    geom_histogram(aes(y=..density..),      # Histogram with density instead of count on y-axis
                   binwidth=.5,
                   colour="steelblue1", fill="white") +
    geom_density(alpha=.2, fill="steelblue1")

speechiness <- ggplot(data_corr, aes(x=speechiness)) + 
    geom_histogram(aes(y=..density..),      # Histogram with density instead of count on y-axis
                   binwidth=.5,
                   colour="steelblue1", fill="white") +
    geom_density(alpha=.2, fill="steelblue1")

mode <- ggplot(data_corr, aes(x=mode)) + 
    geom_histogram(aes(y=..density..),      # Histogram with density instead of count on y-axis
                   binwidth=.5,
                   colour="steelblue1", fill="white") +
    geom_density(alpha=.2, fill="steelblue1")

liveness <- ggplot(data_corr, aes(x=liveness)) + 
    geom_histogram(aes(y=..density..),      # Histogram with density instead of count on y-axis
                   binwidth=.5,
                   colour="steelblue1", fill="white") +
    geom_density(alpha=.2, fill="steelblue1")

loudness <- ggplot(data_corr, aes(x=loudness)) + 
    geom_histogram(aes(y=..density..),      # Histogram with density instead of count on y-axis
                   binwidth=.5,
                   colour="steelblue1", fill="white") +
    geom_density(alpha=.2, fill="steelblue1")

popularity <- ggplot(data_corr, aes(x=popularity)) + 
    geom_histogram(aes(y=..density..),      # Histogram with density instead of count on y-axis
                   binwidth=.5,
                   colour="steelblue1", fill="white") +
    geom_density(alpha=.2, fill="steelblue1")

energy <- ggplot(data_corr, aes(x=energy)) + 
    geom_histogram(aes(y=..density..),      # Histogram with density instead of count on y-axis
                   binwidth=.5,
                   colour="steelblue1", fill="white") +
    geom_density(alpha=.2, fill="steelblue1")


key <- ggplot(data_corr, aes(x=key)) + 
    geom_histogram(aes(y=..density..),      # Histogram with density instead of count on y-axis
                   binwidth=.5,
                   colour="steelblue1", fill="white") +
    geom_density(alpha=.2, fill="steelblue1")
library(ggpubr)
figure <- ggarrange(valence, tempo, acousticness,speechiness, danceability,mode,liveness,loudness,popularity,energy, key, ncol = 3, nrow = 5)
annotate_figure(figure, top = text_grob("Density Histogram for all variables", 
               color = "black", face = "bold", size = 14))

From the distribution plot, we can observe that, columns such as energy, and valence has high variability where as liveness, danceability, tempo and loudness has relatively lower variability and concentrated in some ranges.

ggplot(data_corr, aes(x=energy, y=popularity)) + geom_point(col = "steelblue1") 

ggplot(data_corr, aes(x=acousticness, y=popularity)) + geom_point(col = "steelblue1") 

ggplot(data_corr, aes(x=loudness, y=popularity)) + geom_point(col = "steelblue1")

data_corr <- data_corr %>% mutate(popularity_t = if_else(popularity > 50, 1, 0))
head(data_corr)
##   popularity duration_ms danceability energy key loudness mode speechiness
## 1         73      230666        0.676 0.4610   1   -6.746    0      0.1430
## 2         55      149610        0.420 0.1660   1  -17.235    1      0.0763
## 3         57      210826        0.438 0.3590   0   -9.734    1      0.0557
## 4         71      201933        0.266 0.0596   0  -18.515    1      0.0363
## 5         82      198853        0.618 0.4430   2   -9.681    1      0.0526
## 6         58      214240        0.688 0.4810   6   -8.807    1      0.1050
##   acousticness instrumentalness liveness valence   tempo time_signature
## 1       0.0322         1.01e-06   0.3580   0.715  87.917              4
## 2       0.9240         5.56e-06   0.1010   0.267  77.489              4
## 3       0.2100         0.00e+00   0.1170   0.120  76.332              4
## 4       0.9050         7.07e-05   0.1320   0.143 181.740              3
## 5       0.4690         0.00e+00   0.0829   0.167 119.949              4
## 6       0.2890         0.00e+00   0.1890   0.666  98.017              4
##   duration_min popularity_t
## 1         3.84            1
## 2         2.49            1
## 3         3.51            1
## 4         3.37            1
## 5         3.31            1
## 6         3.57            1

We have added a new column called popularity_t which is based on popularity feature to have a clear answer for our SMART questions. The top 25% of the popular songs will have 1 value and the rest will have 0.

ggplot(data_corr, aes(x = popularity_t, fill=popularity_t)) +
    geom_bar(color="black",fill=c("steelblue3", "blue4"))+
  ggtitle("Popularity_t Distribution")

This bar plot represents the new column we have created, popularity_t. The top 25% of the popular songs have 1 value and the rest will have 0. We can observe most of the songs come under 0 category.

#  t test on popularity_t with numeric variables
pop1 <- subset(data_corr, popularity_t == 1)
pop2 <- subset(data_corr, popularity_t == 0)
str(pop1)
## 'data.frame':    27770 obs. of  16 variables:
##  $ popularity      : int  73 55 57 71 82 58 74 80 74 56 ...
##  $ duration_ms     : int  230666 149610 210826 201933 198853 214240 229400 242946 189613 205594 ...
##  $ danceability    : num  0.676 0.42 0.438 0.266 0.618 0.688 0.407 0.703 0.625 0.442 ...
##  $ energy          : num  0.461 0.166 0.359 0.0596 0.443 0.481 0.147 0.444 0.414 0.632 ...
##  $ key             : int  1 1 0 0 2 6 2 11 0 1 ...
##  $ loudness        : num  -6.75 -17.23 -9.73 -18.52 -9.68 ...
##  $ mode            : int  0 1 1 1 1 1 1 1 1 1 ...
##  $ speechiness     : num  0.143 0.0763 0.0557 0.0363 0.0526 0.105 0.0355 0.0417 0.0369 0.0295 ...
##  $ acousticness    : num  0.0322 0.924 0.21 0.905 0.469 0.289 0.857 0.559 0.294 0.426 ...
##  $ instrumentalness: num  1.01e-06 5.56e-06 0.00 7.07e-05 0.00 0.00 2.89e-06 0.00 0.00 4.19e-03 ...
##  $ liveness        : num  0.358 0.101 0.117 0.132 0.0829 0.189 0.0913 0.0973 0.151 0.0735 ...
##  $ valence         : num  0.715 0.267 0.12 0.143 0.167 0.666 0.0765 0.712 0.669 0.196 ...
##  $ tempo           : num  87.9 77.5 76.3 181.7 119.9 ...
##  $ time_signature  : int  4 4 4 3 4 4 3 4 4 4 ...
##  $ duration_min    : num  3.84 2.49 3.51 3.37 3.31 3.57 3.82 4.05 3.16 3.43 ...
##  $ popularity_t    : num  1 1 1 1 1 1 1 1 1 1 ...
str(pop2)
## 'data.frame':    86230 obs. of  16 variables:
##  $ popularity      : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ duration_ms     : int  216386 231266 302346 131760 273653 131760 131760 131760 131760 234186 ...
##  $ danceability    : num  0.572 0.796 0.755 0.62 0.633 0.62 0.62 0.62 0.62 0.593 ...
##  $ energy          : num  0.454 0.667 0.454 0.309 0.429 0.309 0.309 0.309 0.309 0.455 ...
##  $ key             : int  3 5 9 5 4 5 5 5 5 6 ...
##  $ loudness        : num  -10.29 -4.83 -9.61 -9.21 -6.78 ...
##  $ mode            : int  1 0 0 1 0 1 1 1 1 1 ...
##  $ speechiness     : num  0.0258 0.0392 0.0352 0.0495 0.0381 0.0495 0.0495 0.0495 0.0495 0.0388 ...
##  $ acousticness    : num  0.477 0.381 0.757 0.788 0.0444 0.788 0.788 0.788 0.788 0.366 ...
##  $ instrumentalness: num  1.37e-05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ...
##  $ liveness        : num  0.0974 0.221 0.236 0.146 0.132 0.146 0.146 0.146 0.146 0.0914 ...
##  $ valence         : num  0.515 0.754 0.33 0.664 0.52 0.664 0.664 0.664 0.664 0.564 ...
##  $ tempo           : num  140 98 120 145 144 ...
##  $ time_signature  : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ duration_min    : num  3.61 3.85 5.04 2.2 4.56 2.2 2.2 2.2 2.2 3.9 ...
##  $ popularity_t    : num  0 0 0 0 0 0 0 0 0 0 ...
ttest_dance <- t.test(pop1$danceability, pop2$danceability)
ttest_dance
## 
##  Welch Two Sample t-test
## 
## data:  pop1$danceability and pop2$danceability
## t = 18.415, df = 50827, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.01880043 0.02327923
## sample estimates:
## mean of x mean of y 
## 0.5827147 0.5616748
ttest_energy <- t.test(pop1$energy, pop2$energy)
ttest_energy
## 
##  Welch Two Sample t-test
## 
## data:  pop1$energy and pop2$energy
## t = -7.7581, df = 49513, p-value = 8.785e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.016347809 -0.009753551
## sample estimates:
## mean of x mean of y 
## 0.6315112 0.6445619
ttest_loudness <- t.test(pop1$loudness, pop2$loudness)
ttest_loudness
## 
##  Welch Two Sample t-test
## 
## data:  pop1$loudness and pop2$loudness
## t = 11.445, df = 46058, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.3329218 0.4705143
## sample estimates:
## mean of x mean of y 
## -7.955099 -8.356817
ttest_speechiness <- t.test(pop1$speechiness, pop2$speechiness)
ttest_speechiness
## 
##  Welch Two Sample t-test
## 
## data:  pop1$speechiness and pop2$speechiness
## t = -20.368, df = 70492, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.01333412 -0.01099314
## sample estimates:
##  mean of x  mean of y 
## 0.07545150 0.08761513
ttest_acousticness <- t.test(pop1$acousticness, pop2$acousticness)
ttest_acousticness
## 
##  Welch Two Sample t-test
## 
## data:  pop1$acousticness and pop2$acousticness
## t = -12.394, df = 49915, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.03176481 -0.02309010
## sample estimates:
## mean of x mean of y 
## 0.2941638 0.3215913
ttest_instrumentalness <- t.test(pop1$instrumentalness, pop2$instrumentalness)
ttest_instrumentalness
## 
##  Welch Two Sample t-test
## 
## data:  pop1$instrumentalness and pop2$instrumentalness
## t = -30.628, df = 56478, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.06276323 -0.05521336
## sample estimates:
## mean of x mean of y 
## 0.1114306 0.1704189
ttest_liveness <- t.test(pop1$liveness, pop2$liveness)
ttest_liveness
## 
##  Welch Two Sample t-test
## 
## data:  pop1$liveness and pop2$liveness
## t = -34.294, df = 63174, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.04074555 -0.03633985
## sample estimates:
## mean of x mean of y 
## 0.1843990 0.2229417
ttest_valence <- t.test(pop1$valence, pop2$valence)
ttest_valence
## 
##  Welch Two Sample t-test
## 
## data:  pop1$valence and pop2$valence
## t = -14.323, df = 51425, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.02763238 -0.02098022
## sample estimates:
## mean of x mean of y 
## 0.4556829 0.4799892
ttest_tempo <- t.test(pop1$tempo, pop2$tempo)
ttest_tempo
## 
##  Welch Two Sample t-test
## 
## data:  pop1$tempo and pop2$tempo
## t = -4.5327, df = 48562, p-value = 5.837e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.3165064 -0.5216582
## sample estimates:
## mean of x mean of y 
##  121.4526  122.3717
ttest_duration_ms <- t.test(pop1$duration_ms, pop2$duration_ms)
ttest_duration_ms
## 
##  Welch Two Sample t-test
## 
## data:  pop1$duration_ms and pop2$duration_ms
## t = -17.215, df = 73865, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11405.089  -9073.481
## sample estimates:
## mean of x mean of y 
##  220284.1  230523.4

After created the popularity_t column, we split data into two based on the popularity_t. We would like to see the differences on means of numerical variables between two groups, so we decided to do hypothesis testing on the variables. H0: The means of numerical variables will be same between different Popularity levels. H1: The means of numerical variables will NOT be same between different Popularity levels. For every numerical variables, means from the popular songs and less popular song were different since the p-values are small enough. Therefore we can conclude that numerical variables except for valence and energy affect the popularity.

str(data_corr)
## 'data.frame':    114000 obs. of  16 variables:
##  $ popularity      : int  73 55 57 71 82 58 74 80 74 56 ...
##  $ duration_ms     : int  230666 149610 210826 201933 198853 214240 229400 242946 189613 205594 ...
##  $ danceability    : num  0.676 0.42 0.438 0.266 0.618 0.688 0.407 0.703 0.625 0.442 ...
##  $ energy          : num  0.461 0.166 0.359 0.0596 0.443 0.481 0.147 0.444 0.414 0.632 ...
##  $ key             : int  1 1 0 0 2 6 2 11 0 1 ...
##  $ loudness        : num  -6.75 -17.23 -9.73 -18.52 -9.68 ...
##  $ mode            : int  0 1 1 1 1 1 1 1 1 1 ...
##  $ speechiness     : num  0.143 0.0763 0.0557 0.0363 0.0526 0.105 0.0355 0.0417 0.0369 0.0295 ...
##  $ acousticness    : num  0.0322 0.924 0.21 0.905 0.469 0.289 0.857 0.559 0.294 0.426 ...
##  $ instrumentalness: num  1.01e-06 5.56e-06 0.00 7.07e-05 0.00 0.00 2.89e-06 0.00 0.00 4.19e-03 ...
##  $ liveness        : num  0.358 0.101 0.117 0.132 0.0829 0.189 0.0913 0.0973 0.151 0.0735 ...
##  $ valence         : num  0.715 0.267 0.12 0.143 0.167 0.666 0.0765 0.712 0.669 0.196 ...
##  $ tempo           : num  87.9 77.5 76.3 181.7 119.9 ...
##  $ time_signature  : int  4 4 4 3 4 4 3 4 4 4 ...
##  $ duration_min    : num  3.84 2.49 3.51 3.37 3.31 3.57 3.82 4.05 3.16 3.43 ...
##  $ popularity_t    : num  1 1 1 1 1 1 1 1 1 1 ...
unique(data_corr$popularity_t)
## [1] 1 0
poptfactor <- data_corr
poptfactor$popularity_t <- as.factor(data_corr$popularity_t)

boxt <- ggplot(poptfactor, aes(x=popularity_t, y=tempo,fill=popularity_t)) + 
  geom_boxplot() +
  ggtitle("boxplot for tempo")

boxdu <-ggplot(poptfactor, aes(x=popularity_t, y=duration_ms,fill=popularity_t)) + 
  geom_boxplot() +
  ggtitle("boxplot for duration_ms")

boxv <-ggplot(poptfactor, aes(x=popularity_t, y=valence,fill=popularity_t)) + 
  geom_boxplot() +
  ggtitle("boxplot for valence")

boxi <-ggplot(poptfactor, aes(x=popularity_t, y=instrumentalness,fill=popularity_t)) + 
  geom_boxplot() +
  ggtitle("boxplot for instrumentalness")

boxa <-ggplot(poptfactor, aes(x=popularity_t, y=acousticness,fill=popularity_t)) + 
  geom_boxplot() +
  ggtitle("boxplot for acousticness")

boxda <-ggplot(poptfactor, aes(x=popularity_t, y=danceability,fill=popularity_t)) + 
  geom_boxplot() +
  ggtitle("boxplot for danceability")

boxe <-ggplot(poptfactor, aes(x=popularity_t, y=energy,fill=popularity_t)) + 
  geom_boxplot() +
  ggtitle("boxplot for energy")

boxlo <-ggplot(poptfactor, aes(x=popularity_t, y=loudness,fill=popularity_t)) + 
  geom_boxplot() +
  ggtitle("boxplot for loudness")

boxv <-ggplot(poptfactor, aes(x=popularity_t, y=valence,fill=popularity_t)) + 
  geom_boxplot(col="black", fill=c("steelblue1","blue4")) +
  ggtitle("boxplot for valence")

boxi <-ggplot(poptfactor, aes(x=popularity_t, y=instrumentalness,fill=popularity_t)) + 
  geom_boxplot(col="black", fill=c("steelblue1","blue4")) +
  ggtitle("boxplot for instrumentalness")

boxa <-ggplot(poptfactor, aes(x=popularity_t, y=acousticness,fill=popularity_t)) + 
  geom_boxplot(col="black", fill=c("steelblue1","blue4")) +
  ggtitle("boxplot for acousticness")

boxda <-ggplot(poptfactor, aes(x=popularity_t, y=danceability,fill=popularity_t)) + 
  geom_boxplot(col="black", fill=c("steelblue1","blue4")) +
  ggtitle("boxplot for danceability")

boxe <-ggplot(poptfactor, aes(x=popularity_t, y=energy,fill=popularity_t)) + 
  geom_boxplot(col="black", fill=c("steelblue1","blue4")) +
  ggtitle("boxplot for energy")

boxlo <-ggplot(poptfactor, aes(x=popularity_t, y=loudness,fill=popularity_t)) + 
  geom_boxplot(col="black", fill=c("steelblue1","blue4")) +
  ggtitle("boxplot for loudness")

boxs <-ggplot(poptfactor, aes(x=popularity_t, y=speechiness,fill=popularity_t)) + 
  geom_boxplot(col="black", fill=c("steelblue1","blue4")) +
  ggtitle("boxplot for speechiness")

boxli <-ggplot(poptfactor, aes(x=popularity_t, y=liveness,fill=popularity_t)) + 
  geom_boxplot(col="black", fill=c("steelblue1","blue4")) +
  ggtitle("boxplot for liveness")

These are boxplots for each numerical variable that separated by its popularity level.

library("ggpubr")
boxforvar <- ggarrange(boxt, boxv,boxi,boxa,boxda,boxe,boxlo,boxs, boxli, ncol = 3, nrow = 4)
boxforvar

boxt

boxv

boxi

boxa

boxda

boxe

boxlo

boxs

boxli

# boxplots
boxs <-ggplot(poptfactor, aes(x=popularity_t, y=speechiness,fill=popularity_t)) + 
  geom_boxplot() +
  ggtitle("boxplot for speechiness")

boxli <-ggplot(poptfactor, aes(x=popularity_t, y=liveness,fill=popularity_t)) + 
  geom_boxplot() +
  ggtitle("boxplot for liveness")
boxforvar <- ggarrange(boxt, boxdu, boxv,boxi,boxa,boxda,boxe,boxlo,boxs, boxli, ncol = 3, nrow = 4)
boxforvar

chi_key <- chisq.test(table(data_corr$popularity_t, data_corr$key))
chi_key
## 
##  Pearson's Chi-squared test
## 
## data:  table(data_corr$popularity_t, data_corr$key)
## X-squared = 190.31, df = 11, p-value < 2.2e-16
chi_mode <- chisq.test(table(data_corr$popularity_t, data_corr$mode))
chi_mode
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(data_corr$popularity_t, data_corr$mode)
## X-squared = 78.261, df = 1, p-value < 2.2e-16
chi_time_signature <- chisq.test(table(data_corr$popularity_t, data_corr$time_signature))
chi_time_signature
## 
##  Pearson's Chi-squared test
## 
## data:  table(data_corr$popularity_t, data_corr$time_signature)
## X-squared = 182.05, df = 4, p-value < 2.2e-16

Since we have popularity t which is categorical value, we performed the chi-square test with categorical variables which is key, mode, and time signature.
The null hypothesis here are will be whether the song is popular or not is independent with key/mode/time signature. The alternate hypothesis are they are NOT independent. From the tests, the p-values are small enough to reject the null, which means that they are dependent.

unique(pop1$key)
##  [1]  1  0  2  6 11  8  4  7  3 10  5  9
unique(pop1$mode)
## [1] 0 1
unique(pop1$time_signature)
## [1] 4 3 1 5 0
mode1 <- subset(data_corr, mode == 1)
mode0 <- subset(data_corr, mode == 0)
str(mode1)
## 'data.frame':    72681 obs. of  16 variables:
##  $ popularity      : int  55 57 71 82 58 74 80 74 56 74 ...
##  $ duration_ms     : int  149610 210826 201933 198853 214240 229400 242946 189613 205594 244800 ...
##  $ danceability    : num  0.42 0.438 0.266 0.618 0.688 0.407 0.703 0.625 0.442 0.627 ...
##  $ energy          : num  0.166 0.359 0.0596 0.443 0.481 0.147 0.444 0.414 0.632 0.363 ...
##  $ key             : int  1 0 0 2 6 2 11 0 1 8 ...
##  $ loudness        : num  -17.23 -9.73 -18.52 -9.68 -8.81 ...
##  $ mode            : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ speechiness     : num  0.0763 0.0557 0.0363 0.0526 0.105 0.0355 0.0417 0.0369 0.0295 0.0291 ...
##  $ acousticness    : num  0.924 0.21 0.905 0.469 0.289 0.857 0.559 0.294 0.426 0.279 ...
##  $ instrumentalness: num  5.56e-06 0.00 7.07e-05 0.00 0.00 2.89e-06 0.00 0.00 4.19e-03 0.00 ...
##  $ liveness        : num  0.101 0.117 0.132 0.0829 0.189 0.0913 0.0973 0.151 0.0735 0.0928 ...
##  $ valence         : num  0.267 0.12 0.143 0.167 0.666 0.0765 0.712 0.669 0.196 0.301 ...
##  $ tempo           : num  77.5 76.3 181.7 119.9 98 ...
##  $ time_signature  : int  4 4 3 4 4 3 4 4 4 4 ...
##  $ duration_min    : num  2.49 3.51 3.37 3.31 3.57 3.82 4.05 3.16 3.43 4.08 ...
##  $ popularity_t    : num  1 1 1 1 1 1 1 1 1 1 ...
str(mode0)
## 'data.frame':    41319 obs. of  16 variables:
##  $ popularity      : int  73 52 54 0 1 0 0 0 0 46 ...
##  $ duration_ms     : int  230666 198712 169728 231266 302346 273653 257493 257493 266960 213098 ...
##  $ danceability    : num  0.676 0.489 0.795 0.796 0.755 0.633 0.409 0.409 0.568 0.596 ...
##  $ energy          : num  0.461 0.314 0.0841 0.667 0.454 0.429 0.153 0.153 0.301 0.2 ...
##  $ key             : int  1 7 10 5 9 4 6 6 10 0 ...
##  $ loudness        : num  -6.75 -9.24 -18.09 -4.83 -9.61 ...
##  $ mode            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ speechiness     : num  0.143 0.0331 0.0461 0.0392 0.0352 0.0381 0.0306 0.0306 0.0373 0.0305 ...
##  $ acousticness    : num  0.0322 0.749 0.742 0.381 0.757 0.0444 0.939 0.939 0.796 0.91 ...
##  $ instrumentalness: num  1.01e-06 0.00 1.17e-05 0.00 0.00 0.00 2.56e-05 2.56e-05 3.37e-06 1.83e-04 ...
##  $ liveness        : num  0.358 0.113 0.0853 0.221 0.236 0.132 0.108 0.108 0.118 0.0884 ...
##  $ valence         : num  0.715 0.607 0.609 0.754 0.33 0.52 0.18 0.18 0.142 0.308 ...
##  $ tempo           : num  87.9 124.2 91.8 98 120.1 ...
##  $ time_signature  : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ duration_min    : num  3.84 3.31 2.83 3.85 5.04 4.56 4.29 4.29 4.45 3.55 ...
##  $ popularity_t    : num  1 1 1 0 0 0 0 0 0 0 ...
ttest_mode <- t.test(mode1$popularity, mode0$popularity)
ttest_mode
## 
##  Welch Two Sample t-test
## 
## data:  mode1$popularity and mode0$popularity
## t = -4.6709, df = 84062, p-value = 3.003e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.9176289 -0.3751582
## sample estimates:
## mean of x mean of y 
##  33.00425  33.65064
aovtime_signature <- aov(popularity ~ time_signature, data = data_corr)
summary(aovtime_signature)
##                    Df   Sum Sq Mean Sq F value Pr(>F)    
## time_signature      1    54761   54761   110.2 <2e-16 ***
## Residuals      113998 56661626     497                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Also, these are results of statistical test that performed with the original popularity and categorical variables such as mode and time signature to see the differences between means from different categories. For mode, since it has two value, we performed the t-test. The null hypothesis here is the means from different modes have same values. The alternate hypothesis is they have different values by modes. For time signature, we took ANOVA test since it has 5 different categories. The null hypothesis here is the means from different time signature have same values. The alternate hypothesis is they have different values by time signature. For both, p-value is small enough to reject the null hypothesis.

6. Model Building

6.1 Feature Selection

We made an effort to decide which features should be present before developing a model. As a result, we tried to see the Mallow’s Cp statistic and performed feature selection. In contrast, when we chose our features based on the lowest Cp value, we obtained an excessively high accuracy score, indicating that the model was over-fitted. Therefore, we made the decision to construct a complete model with all the features.

library("leaps")
reg2.best <- regsubsets(popularity_t~. , data = data_corr, nvmax = 12, nbest = 3, method = "exhaustive") 
plot(reg2.best, scale = "Cp", main = "Cp")

6.2 Spliting the data

To Prepare your data for training We are Splitting the data into train and testing dataset.We are Separating features and target label.Our target label is popularity_t and features are all other parameters available in the dataset.

set.seed(123)
dat.d <- sample(1:nrow(data_corr),size=nrow(data_corr)*0.8,replace = FALSE) #random selection of 70% data.
data_corr.a = subset(data_corr, select = c(-popularity,-duration_ms) )
 
 
train <- data_corr.a[dat.d,] # 80% training data
test <- data_corr.a[-dat.d,] # remaining 20% test data

6.3 Which Model to use?

This is a classification problem where we need to detect if song will be popular(1) or not popular (0). There are multiple algorithms for classification problems We are going to apply the below algorithms.

  1. KNN Algorithm
  2. Logistic Regression
  3. Decision Trees

To answer our SMART question, “Can we recommend certain songs to users who like to listen popular songs?”, we are using above classfication models to classify the popular and unpopular songs based on the features that we analyzed. The model with the highest accuracy will be adopted.

6.4 KNN

The first model we tried to build is KNN.

library(class)
prc_train_labels <- train[,14]
prc_test_labels<- test[,14]

6.4.1. Selecting the correct “k”

How does “k” affect classification accuracy? Let’s create a function to calculate classification accuracy based on the number of “k.”

chooseK = function(k, train_set, val_set, train_class, val_class){
  
  # Build knn with k neighbors considered.
  set.seed(1)
  class_knn = knn(train = train_set,    #<- training set cases
                  test = val_set,       #<- test set cases
                  cl = train_class,     #<- category for classification
                  k = k) #,                #<- number of neighbors considered
                  # use.all = TRUE)       #<- control ties between class assignments. If true, all distances equal to the k-th largest are included
  
  tab = table(class_knn, val_class)
  
  # Calculate the accuracy.
  accu = sum(tab[row(tab) == col(tab)]) / sum(tab)                         
  cbind(k = k, accuracy = accu)
}

# The sapply() function plugs in several values into our chooseK function.
# function(x)[function] allows you to apply a series of numbers
# to a function without running a for() loop.
knn_different_k = sapply(seq(3, 15, by = 2),  #<- set k to be odd number from 1 to 21
                         function(x) chooseK(x, 
                                             train_set = train,
                                             val_set = test,
                                             train_class = prc_train_labels,
                                             val_class = prc_test_labels))

# Reformat the results to graph the results.
str(knn_different_k)
##  num [1:2, 1:7] 3 0.926 5 0.92 7 ...
knn_different_k = data.frame(k = knn_different_k[1,],
                             accuracy = knn_different_k[2,])

# Plot accuracy vs. k.
# install.packages("ggplot2")
loadPkg("ggplot2")

ggplot(knn_different_k,
       aes(x = k, y = accuracy)) +
  geom_line(color = "steelblue1", size = 1.5) +
  geom_point(size = 3) + 
  labs(title = "accuracy vs k")

From the above results we can observe that for value k=3, we are getting the maximum accuracy. We shall go forward and evaluate the model performance by plotting the confusion matrix of the model.

prc_test_pred <- knn(train = train, test = test,cl = prc_train_labels, k=3)

###6.4.2 Confusion Matrix

cm_knn<-confusionMatrix(as.factor(prc_test_labels),as.factor(prc_test_pred))
cm_knn
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 16767   468
##          1  1211  4354
##                                           
##                Accuracy : 0.9264          
##                  95% CI : (0.9229, 0.9297)
##     No Information Rate : 0.7885          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.791           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9326          
##             Specificity : 0.9029          
##          Pos Pred Value : 0.9728          
##          Neg Pred Value : 0.7824          
##              Prevalence : 0.7885          
##          Detection Rate : 0.7354          
##    Detection Prevalence : 0.7559          
##       Balanced Accuracy : 0.9178          
##                                           
##        'Positive' Class : 0               
## 

This model’s overall accuracy, which compares favorably to the other two models, is 92.55%. Here, we can see that 16757 songs were correctly projected as unpopular, while 4345 songs were successfully forecasted as popular. In contrast, 478 songs were incorrectly predicted as unpopular, while 1220 songs were incorrectly predicted as popular.

6.5 Logistic Regression Model

Next, we are building the logistic regression model.

model = glm(popularity_t~.,data=train )

6.5.1 Confusion Matrix

Predicr_value = predict(model,test)                          #predict test value purchased

predict_test_value = ifelse((Predicr_value>0.5),1,0)

                           #See predict value and actual
Compare_values = cbind(predict_test_value,test[,14])
Compare_values = data.frame(Compare_values)
cm_lg<-caret::confusionMatrix(as.factor(Compare_values[,1]),as.factor(Compare_values[,2]))
cm_lg
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 17235  5565
##          1     0     0
##                                           
##                Accuracy : 0.7559          
##                  95% CI : (0.7503, 0.7615)
##     No Information Rate : 0.7559          
##     P-Value [Acc > NIR] : 0.5036          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.7559          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.7559          
##          Detection Rate : 0.7559          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
## 

As you can see, there are certain instances where disliked music have been mistakenly categorized as popular. The classification prediction accuracy is good, coming in at roughly 75.59%. 24.4% of classification errors are misclassified. Here, the summary shows that, with the exception of the key variable, all other variables have a significant impact on popularity.

6.6 Decision Tree

Lastly, we tried to build a decision tree model.

library(rpart)
library(rpart.plot)

fit <- rpart(popularity_t~., data = train, method = 'class')

6.6.1 Confusion Matrix

predict_unseen <-predict(fit, test, type = 'class')
table_mat <- table(test$popularity_t, predict_unseen)
accuracy_Test <- sum(diag(table_mat)) / sum(table_mat)
print(paste('Accuracy for test', accuracy_Test))
## [1] "Accuracy for test 0.755921052631579"
cm_dt<-confusionMatrix(as.factor(test$popularity_t),as.factor(predict_unseen))
cm_dt
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 17235     0
##          1  5565     0
##                                           
##                Accuracy : 0.7559          
##                  95% CI : (0.7503, 0.7615)
##     No Information Rate : 1               
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7559          
##             Specificity :     NA          
##          Pos Pred Value :     NA          
##          Neg Pred Value :     NA          
##              Prevalence : 1.0000          
##          Detection Rate : 0.7559          
##    Detection Prevalence : 0.7559          
##       Balanced Accuracy :     NA          
##                                           
##        'Positive' Class : 0               
## 

As you can see, there are several instances where songs that are considered to be popular but are actually considered unpopular. For a straightforward decision tree classifier, the model’s overall accuracy of 75.59%, which is comparable to logistic regression, is more than acceptable.

Comparing all the models

model_compare <- data.frame(Model = c('KNN',
                                      'Decision Tree',
                                      'Logistic Regression'),
                            Accuracy = c((cm_knn$overall[1])*100,
                                         (cm_dt$overall[1])*100,
                                         (cm_lg$overall[1])*100))

print(model_compare)
##                 Model Accuracy
## 1                 KNN 92.63596
## 2       Decision Tree 75.59211
## 3 Logistic Regression 75.59211
ggplot(aes(x=Model, y=Accuracy), data=model_compare) +
    geom_bar(stat='identity', fill = 'steelblue1') +
    ggtitle('Comparative Accuracy of Models on Cross-Validation Data') +
    xlab('Models') +
    ylab('Overall Accuracy')

Comparing all the three models we can conclude that KNN is the best algorithm to classify popular and unpopular songs having 92.55% accuracy. Therefore, it would be better to use KNN to answer our SMART question for the conclusion.

5. Conclusion

The main question we are trying to answer with all the models is “Can we recommend certain songs to users who like to listen popular songs?”. According to our analysis, the answer will be “YES”. By using KNN model that we built, we can classify popular and unpopular song. Thus we can use this prediction model to recommend a certain song to users on Spotify who like to listen only popular songs.
Through the exploratory data analysis, we can identify audio features which affect the popularity of songs to answer the first question. All the numerical variables such as danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, and tempo affect the popularity of songs since they have relationship with popularity statistically. For categorical variables, such as key, mode, and time signature, they are also dependent to popularity from the results of chi-squared tests.
In a future project, we would like to include other features that we couldn’t explore this time like artist name or genre that heavily influence the popularity of a song in our analysis to acquire different insights.